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Abstract. The edit distance problem is a classical fundamental problem in computer 
science in general, and in combinatorial pattern matching in particular. The standard 
dynamic-programming solution for this problem computes the edit-distance between a 
pair of strings of total length O(N) in 0(N 2 ) time. To this date, this quadratic upper- 
bound has never been substantially improved for general strings. However, there are known 
techniques for breaking this bound in case the strings are known to compress well under 
a particular compression scheme. The basic idea is to first compress the strings, and then 
to compute the edit distance between the compressed strings. 

As it turns out, practically all known o(N 2 ) edit-distance algorithms work, in some 
sense, under the same paradigm described above. It is therefore natural to ask whether 
there is a single edit-distance algorithm that works for strings which are compressed under 
any compression scheme. A rephrasing of this question is to ask whether a single algorithm 
can exploit the compressibility properties of strings under any compression method, even 
if each string is compressed using a different compression. In this paper we set out to 
answer this question by using straight-line programs. These provide a generic platform 
for representing many popular compression schemes including the LZ-family, Run-Length 
Encoding, Byte-Pair Encoding, and dictionary methods. 

For two strings of total length N having straight-line program representations of 
total size n, we present an algorithm running in 0(n 1A N 1 ' 2 ) time for computing the edit- 
distance of these two strings under any rational scoring function, and an 0(n 1,34 N 1 ' 34 )- 
time algorithm for arbitrary scoring functions. This improves on a recent algorithm of 
Tiskin that runs in OfoN 1 ' 5 ) time, and works only for rational scoring functions. 
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1. Introduction 

The edit distance between two strings over a fixed alphabet E is the minimum cost 
of transforming one string into the other via a sequence of character deletion, insertion, 
and replacement operations [31]. The cost of these elementary editing operations is given 
by some scoring function which induces a metric on strings over S. The simplest and 
most common scoring function is the Levenshtein distance [16] which assigns a uniform 
score of 1 for every operation. Determining the edit-distance between a pair of strings is a 
fundamental problem in computer science in general, and in combinatorial pattern matching 
in particular, with applications ranging from database indexing and word processing, to 
bioinformatics [11]. 

The standard dynamic programming solution for computing the edit distance between a 
pair of strings A = a\Q2 ■ ■ ■ a at and B = 6162 • • • 6jv involves filling in an (iV+1) x (iV+1) table 
T, with T[i,j] storing the edit distance between a±a2 ■ ■ ■ a% and &162 ■ ■ ■ bj. The computation 
is done according to the base-case rules given by T[0, 0] = 0, T[i, 0] = T[i— 1, 0]+ the cost of 
deleting a^, and T[0, j] = T[0,j — 1]+ the cost of inserting bj, and according to the following 
dynamic programming step: 



Note that as T has (N + l) 2 entries, the time-complexity of the algorithm above is 0(N 2 ). 

Compression is traditionally used to efficiently store data. In this paper, we focus on 
using compression to accelerate the dynamic-programming solution for the edit-distance 
problem described above. The basic idea is to first compress the strings, and then com- 
pute the edit distance between the compressed strings. Note that the "acceleration via 
compression" approach has been successfully applied also to other classical problems on 
strings. Various compression schemes, such as LZ77 [33], LZW-LZ78 [32], Huffman coding, 
Byte-Pair Encoding (BPE) [27], Run-Length Encoding (RLE), were employed to accelerate 
exact string matching [3, 13, 17, 20, 28], subsequence matching [9], approximate pattern 
matching [2, 12, 13, 24], and more [23]. 

Regarding edit-distance computation, Bunke and Csirik presented a simple algorithm 
for computing the edit-distance of strings that compress well under RLE [8] . This algorithm 
was later improved in a sequence of papers [5, 6, 10, 19] to an algorithm running in time 
0(nN), for strings of total length N that encode into run-length strings of total length 
n. In [10], an algorithm with the same time complexity was given for strings that are 
compressed under LZW-LZ78, where n again is the length of the compressed strings. Note 
that this algorithm is also 0{N 2 / \gN) in the worst-case for any strings over constant-size 
alphabets. 

The first paper to break the quadratic time-barrier of edit-distance computation was the 
seminal paper of Masek and Paterson [21], who applied the " Four- Russians technique" to 
obtain a running-time of 0(N 2 /IgN) for any pair of strings, and of 0(N 2 /lg 2 N) assuming 
a unit-cost RAM model. Their algorithm essentially exploits repetitions in the strings to 
obtain the speed-up, and so in many ways it can also be viewed as compression-based. In 
fact, one can say that their algorithm works on the "naive compression" that all strings over 
constant-sized alphabets have. A drawback of the the Masek and Paterson algorithm is that 
it can only be applied when the given scoring function is rational. That is, when all costs of 



T[i — l,j] + the cost of deleting Oj 
T[i,j] = min < T[i,j — 1] + the cost of inserting bj 

T[i — l,j — 1] + the cost of replacing a« with bj 



(1.1) 
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editing operations are rational numbers. Note that this restriction is indeed a limitation in 
biological applications, where PAM and evolutionary distance similarity matrices are used 
for scoring [10, 21]. For this reason, the algorithm in [10] mentioned above was designed 
specifically to work for arbitrary scoring functions. We mentioned also Bille and Farach- 
Colton [7] who extend the Masek and Paterson algorithm to general alphabets. 

There are two important things to observe from the above: First, all known techniques 
for improving on the 0(N 2 ) time bound of edit-distance computation, essentially apply ac- 
celeration via compression. Second, apart from RLE, LZW-LZ78, and the naive compression 
of the Four-Russians technique, we do not know how to efficiently compute edit-distance 
under other compression schemes. For example, no algorithm is known which substantially 
improves 0(N 2 ) on strings which compress well under LZ77. Such an algorithm would be 
interesting since there are various types of strings that compress much better under LZ77 
than under RLE or LZW-LZ78. In light of this, and due to the practical and theoretical 
importance of substantially improving on the quadratic lower bound of string edit-distance 
computation, we set out to answer the following question: 

"Is there a general compression-based edit-distance algorithm that can ex- 
ploit the compressibility of two strings under any compression scheme?" 

A key ingredient to answering this question, we believe, lies in a notion borrowed from the 

world of formal languages: The notion of straight-line programs. 



1.1. Straight-line programs 

A straight-line program, (SLP) is a context-free grammar generating exactly one string. 
Moreover, only two types of productions are allowed: Xi — > a where a is a unique terminal, 
and Xi — > XpX q with i > p,q where X\, . . . , X n are the grammar variables. Each variable 
appears exactly once on the left hand side of a production. The string represented by a 
given SLP is a unique string corresponding to the last nonterminal X n . We define the size 
of an SLP to be n, the number of variables (or productions) it has. The length of the strings 
that is generated by the SLP is denoted by N. It is important to observe that many SLPs 
can be exponentially smaller than the string they generate. 

Example 1.1. Consider the string abaababaabaab. It could be generated by the following 
SLP, also known as the Fibonacci SLP: 



X! 

x 2 

X A 

x 5 
x e 
x 7 



b 
a 

X2X1 
X3X2 

X5X4 
XqX^ 




Rytter [25] proved that the resulting encoding of most compression schemes including 
the LZ-family, RLE, Byte-Pair Encoding, and dictionary methods, can be transformed to 
straight-line programs quickly and without large expansion 1 . In particular, consider an 



^Important exceptions of this list are statistical compressors such as Huffman or arithmetic coding, as 
well as compressions that are applied after a Burrows- Wheeler transformation. 
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LZ77 encoding [33] with n! blocks for a string of length N . Rytter's algorithm produces an 
SLP-representation with size n = 0{n' log N) of the same string, in 0(n) time. Moreover, 
n lies within a log N factor from the size of a minimal SLP describing the same string. This 
gives us an efficient logarithmic approximation of minimal SLPs, since computing the LZ77 
encoding of a string can be done in linear-time. Note also that any string compressed by 
the LZ78-LZW encoding can be transformed directly into a straight-line program within a 
constant factor. 

1.2. Our results 

Due to Rytter's results, SLPs are perfect candidates for achieving our goal of generaliz- 
ing compression-based edit-distance algorithms. Indeed, a fast edit-distance algorithm for 
strings that have small SLP representations, would give a fast algorithm for strings which 
compress well under the compression schemes generalized by SLPs. Note that since con- 
structing the strings generated by the SLPs requires linear-time in the length of the strings, 
an 0(N 2 ) algorithm is available via the standard dynamic-programming formulation (1.1). 
The main result of this paper gives an algorithm which beats this bound: 

Theorem 1.2. Let A and B be two SLPs of total size n that respectively generate two string 
A and B of total size N . Then, given A and B, one can compute the edit-distance between 
A and B in 0[n N 1 ' 2 ) time for any rational scoring function. 

We can remove the dependency of rational scoring schemes in Theorem 1.2, recalling 
that arbitrary scoring schemes are important for biological applications. We obtain the 
following secondary result for arbitrary scoring functions: 

Theorem 1.3. Let A and B be two SLPs of total size n that respectively generate two string 
A and B of total size N . Then, given A and B, one can compute the edit-distance between 
A and B in 0(n L34 iV L34 ) time for any arbitrary scoring function. 

In the last part of the paper, we explain how the four-russians technique can also be 
incorporated into our SLP edit-distance scheme. We obtain a very simple algorithm that 
matches the performance of [10] in the worst-case. That is, we obtain a four-russian like 
algorithm with an fi(lgiV) speed-up which can handle arbitrary scoring functions, unlike 
the Masek and Paterson algorithm which works only for rational functions. We add this 
algorithm to our presentation not only for its practical importance, but also to emphasize 
the fact that SLPs provide a framework which allows an almost perfect generalization of 
compression-based edit-distance algorithms. 

1.3. Related Work 

Rytter et al. [14] was the first to consider SLPs in the context of pattern matching, 
and other subsequent papers also followed this line [15, 22]. In [25] and [17] Rytter and 
Lifshits took this work one step further by proposing SLPs as a general framework for 
dealing with pattern matching algorithms that are accelerated via compression. However, 
the focus of Lifshits was on determining whether or not these problems are polynomial in n 
or not. In particular, he gave an 0(n 3 )-time algorithm to determine equality of SLPs [17], 
and he established hardness for the edit distance [18], and even for the hamming distance 
problems [17]. Nevertheless, Lifshits posed as an open problem the question of whether or 
not there is an 0(nN) edit-distance algorithm for SLPs. Here, our focus is on algorithms 
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which break the quadratic 0(N 2 ) time-barrier, and therefore all algorithms with running- 
times between 0(nN) and 0(N 2 ) are interesting for us. 

Recently, Tiskin [29] gave an 0(nA^ 1,5 ) algorithm for computing the longest common 
subsequence between two SLPs, an algorithm which can be extended at constant-factor 
cost to compute the edit-distance between the SLPs under any rational scoring function. 
Observe that our algorithm for arbitrary scoring functions in Theorem 1.3 is already faster 
than Tiskin's algorithm for most values of N and n. Also, it has the advantage of being 
much more simpler to implement. As for our main algorithm of Theorem 1.2, our faster 
running-time is achieved also by utilizing some of the techniques used by Tiskin in a more 
elaborate way. 

2. The DIST Table 

The central dynamic-programming tool we use in our algorithms is the DIST table, a 
simple and handy data-structure which was originally introduced by Apostolico et al. [4], 
and then further developed by others in [10, 26]. In the following section we briefly review 
basic facts about this tool that are essential for understanding our results, following mostly 
the presentation in [10]. We begin with the so-called dynamic-programming grid, a graph 
representation of edit-distance computation on which DIST tables are defined. 

Consider the standard dynamic programming formulation (1.1) for computing the 
edit-distance between two strings A = a\a2---aN and B = &1&2 ■■■6jv- The dynamic- 
programming grid associated with this program, is an acyclic-directed graph which has a 
vertex for each entry of T (see Figure 1). The vertex corresponding to T[i,j] is associ- 
ated with a,i and bj, and has incoming edges according to (1.1) - an edge from T[i — 
whose weight is the cost of deleting a^, an edge from T[i,j — 1] whose weight is the cost 
of inserting bj, and an edge from T[i — l,j — 1] whose weight is the cost of replacing cij 
with bj. The value at the vertex corresponding to T[i,j] is the value stored in T[i,j], i.e. 
the edit-distance between the length i prefix of A and the length j prefix of B. Using 
the dynamic-programming grid G, we reduce the problem of computing the edit-distance 
between A and B to the problem of computing the weight of the lightest path from the 
upper-left corner to bottom-right corner in G. 



ABCD + ABCD 




0[4\ 



Figure 1: A subgraph of a Levenshtein distance dynamic program graph. On the left, 
DIST[4,4] (in bold) gives the minimum-weight path from I[4] to 0[4]. On the 
right, the value 9 of 0[4] is computed by minil[i] + DIST[i,4\. 
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We will work with sub-grids of the dynamic-programming grid that will be referred to 
as blocks. The input vertices of a block are all vertices in the first row and column of the 
block, while its output vertices are all vertices in the last row and column. Together, the 
input and output vertices are referred to as the boundary of the block. The substrings of 
A and B associated with the block are defined in the straightforward manner according 
to its first row and column. Also, for convenience purposes, we will order the input and 
output vertices, with both orderings starting from the vertex in bottom-leftmost corner of 
the block, and ending at the vertex in the upper-rightmost corner. The ith input vertex 
and jth output vertex are the ith and jth vertices in these orderings. We next give the 
definition of DIST tables, defined over blocks of G. 

Definition 2.1 (DIST [4]). Let G' be a block in G with x input vertices and x output 
vertices. The DIST table corresponding to G' is an x x x matrix, with DIST[i,j] storing 
the weight of the minimum-weight path from the ith input to the jth output in G, and 
otherwise oo if no such paths exists. 

It is important to notice that the values at the output vertices of a block are completely 
determined by that values at its input and its corresponding DIST table. In particular, if 
I[i] and 0[j] are the values at the ith input vertex and jth output vertex of a block G' of 
G, then 

0[j] = min I\i]+DIST[i,j]. (2.1) 

l<i<a; 

Equation 2.1 implies not only the input-output relation of the dynamic-programming values 
of a block, but also that the values at the output vertices can be computed in linear time from 
the values at the input vertices. Indeed, by (2.1), the values at the output vertices of G' are 
given by the column minima of the matrix I + DIST. Furthermore, by a simple modification 
of all oo values in I + DIST, we get what is known as a totally-monotone matrix [10]. Now, 
Aggarwal et al. [1] gave a simple recursive algorithm, nicknamed SMAWK in the literature, 
that computes all column minima of an x x x totally-monotone matrix by querying only 
0(x) elements of the matrix. It follows that using SMAWK we can compute the output 
values of G' in 0(x) time. 

Let us now discuss how to efficiently construct the DIST table corresponding to a block 
in G. Observe that this can be done quite easily in 0(x 3 ) time, for blocks with boundary 
size 0(x), by computing the standard dynamic-programming table between every prefix of 
A against B and every prefix of B against A. Each of these dynamic-programming tables 
contains all values of a particular row in the DIST table. In [4], Apostolico et al. show an 
elegant way to reduce the time-complexity of this construction to 0(x 2 \gx). In the case 
of rational scoring functions, the complexity can be further reduced to 0(x 2 ) as shown by 
Schmidt [26]. 

3. Acceleration via Straight-Line Programs 

In the following section we describe a generic framework for accelerating the edit dis- 
tance computation of two strings which are given by their SLP representation. This frame- 
work will later be used for explaining all our algorithms. We will refer throughout the paper 
to this framework as the block edit- distance procedure. 

Let A and B be two SLP representations of a pair of strings A and B, and for ease 
of presentation assume that |*4| = \B\ = n and \A\ = \B\ = N. Recall the definition in 
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Section 2 for the dynamic-programming grid corresponding to A and B. The general idea 
behind the block edit-distance procedure is to partition this grid into disjoint blocks, and 
then to compute the edit-distance between A and B at the cost of computing the values at 
the boundary vertices of each block. This is achieved by building in advance a repository 
containing all DIST tables corresponding to blocks in the partition. To efficiently construct 
this repository, we show how to partition the grid in a way which induces many block repeats. 
This is possible by utilizing substring repeats in A and B that are captured in A and B, 
and imply block repeats in the partitioning of G. The edit-distance of A and B is then 
computed by propagating the dynamic programming values at the boundary vertices of the 
blocks using the DIST tables in the repository and SMAWK. Before giving a complete 
description of this algorithm, we need to introduce the notion of xy-partition. 

Definition 3.1 (xy-partition). An xy-partition is a partitioning of G into disjoint blocks 
such that every block has boundary of size 0(x), and there are 0(y) blocks in each row and 
column. In addition, we require each pair of substrings of A and B associated with a block 
to be generated by a pair of SLP variables in A and B. 




An xy-partition of an edit distance graph 
for two SLPs generating the strings 
"ABCABCAB" and "BBCBBBCB". The 
white blocks are the ones of the partition 
and their corresponding SLP variables are 
marked in bold. Notice that there are nine 
blocks in the partition but only six of them 
are distinct. 



Figure 2: An xy-partition. 



An xy-partition of G is a partition with a specific structure, but more importantly, one 
where each substring is generated by a unique SLP variable of A and B. This latter require- 
ment allows us to exploit the repetitions of A and B captured by their SLPs. We next give 
a complete description of the block edit distance procedure. It assumes an xy-partition of 
G has already been constructed. Section 4 explains how to construct such partitions. 



Block Edit Distance 

(1) Construct a repository with the DIST tables corresponding to each block in the 
xy-partition. 

(2) Fill-in the first row and column of G using the standard base-case rules. 

(3) In top-to-bottom and left-to-right manner, identify the next block in the partition 
of G and use its input and the repository to compute its output using (2.1). 

(4) Use the outputs in order to compute the inputs of the next blocks using (1.1). 

(5) The value in the bottom-rightmost cell is the edit distance of A and B. 
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Apart from the repository construction in step 1, all details necessary for implementing 
the block edit-distance procedure are by now clear. Indeed, steps 2 and 5 are trivial, and 
step 4 is done via the standard dynamic-programming formulation of (1.1). Furthermore, 
the SMAWK computation of output values of a block, given its input values plus its cor- 
responding DIST table (step 3), is explained in Section 2. We next show that, as we are 
working with xy-partitions where each block is associated with an SLP variable, we can 
compute a repository containing all DIST necessary which is rather small. 

The first crucial observation for this, is that any two blocks associated with the same 
pair of substrings A' and B' have the same DIST table. This is immediate since any such 
pair of blocks have identical edge- weights. 

Observation 3.2. A pair of substrings A', B' uniquely identify the DIST table of a block. 

Since we required each substring in the xy-partition of G to be generated by some SLP 
variable, the above observation actually suggests that the number of different DIST tables 
is bounded by the number of variable pairs X G A and Y G B: 

Observation 3.3. The number of different DIST tables corresponding to any xy-partition 
is 0{n 2 ). 

Therefore, combining the two observations above, we know that a repository containing 
a DIST tables for each SLP variable pair X <E A and Y <E B will not be too large, and that 
it will contain a table corresponding to each block in our given xy-partition at hand. We 
can therefore state the following lemma: 

Lemma 3.4. The block edit-distance procedure runs in 0(n 2 x 2 lgx + Ny) time. 

Proof. We analyze the time complexity of each step in the block edit-distance procedure 
separately. Step 1 can be performed in 0(n 2 x 2 lgx) time, as we can construct every DIST 
table in 0(x 2 lgx) time (see Section 2), and the total number of such distinct matrices is 
0(n 2 ). Step 2 can be done trivially in O(N) time. Then, step 3 takes 0(x) time per block 
by using the SMAWK algorithm as explained in Section 2. Step 4 also takes 0(x) time per 
block as it only computes the values in the 0[x) vertices adjacent to the output vertices. 
The total time complexity of steps 3 and 4 is thus equal to the total number of boundary 
vertices in the xy-partition of G, and therefore to 0{Ny). Accounting for all steps together, 
this gives us the time complexity stated in the lemma. ■ 

4. Constructing an xy-partition 

In this section we discuss the missing component of Section 3, namely the construction 
of xy-partitions. In particular, we complete the proof of Theorem 1.3 by showing how to 
efficiently construct an xy-partition where y = 0{nN/x) for every x < N. Together with 
Lemma 3.4, this implies an 0{nzNz lg3 N) = 0(n 1,34 A^ 1,34 ) time algorithm for arbitrary 

2 1 

scoring functions by considering x = Nz/(n\g TV) 3. In the remainder of this section we 
prove the following lemma. 

Lemma 4.1. For every x < N there exists an xy -partition with y = 0(nN/x). Moreover, 
this partition can be found in O(N) time. 
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0(n) 



The key-vertices Vi and Vj+i both 
generate substrings of length Q(x), 
and their least common ancestor is 
Uj. The white vertices "hanging" 
of the Vi-to-Vi+i path are the added 
key-vertices. Together, they gener- 
ate the substring that lies between 
the substrings of Vi and Uj+i. 



0{x) 



0(x) 



Figure 3: A closer look on the 



parse tree of an SLP A. 



To prove the lemma, we show that for every SLP A generating a string A and every 
x < N, one can partition A into 0{nN/x) disjoint substrings, each of length 0(x), such 
that every substring is generated by some variable in A. This defines a subset of variables 
in both input SLPs which together defined our desired xy-partition. To partition A, we 
first identify 0(N/x) grammar variables in A each generating a disjoint substring of length 
between x and 2x. We use these variables to partition A. We then show that the substrings 
of A that are still not associated with a variable can each be generated by O(n) additional 
variables. Furthermore, these O(n) variables each generate a string of length bounded by 
x. We add all such variables to our partition of A for a total of 0(nN/x) variables. 

Consider the parse tree of A. We want to identify 0(nN/x) key-vertices such that every 
key- vertex generates a substring of length O(x), and A is a concatenation of substrings 
generated by key-vertices. We start by marking every vertex v that generates a substring 
of length greater than x as a key- vertex iff both children of v generate substrings of length 
smaller than x. This gives us t < N/x key- vertices so far, each generating a substring of 
length 0(x) (see Figure 3). But we are still not guaranteed that these vertices cover A 
entirely. 

To fix this, consider the ordering v\,V2, ■ ■ ■ ,ve on the current key- vertices induced by 
a left-to-right postorder traversal of the parse tree. This way, i>j+i is "to the right of" V{. 
If every Vi generates the substring Ai then A = A' 1 AiA' 2 A2 ■ ■ ■ A'^A£A' e+1 , where every Ai 
is of length 0(x), and every A[ is the "missing" substring of A that lies between and 
Ai. We now show that every N { is a concatenation of substrings of length smaller than x 
generated by at most O(n) vertices. 

Let Ui be the lowest common ancestor of vi and Vi+i and let Pi (resp. -Pj+i) be the 
unique path between Ui and Vi (resp. Vj+i). For every vertex v £ Pi — {u{\ such that 
■u's left child is also in Pi mark v 's right child as a key-vertex. Similarly, for every vertex 
v £ Pi+i — {ui} such that u's right child is also in Pj+i mark w's left child as a key-vertex. 
It is easy to verify that A[ is the concatenation of substrings generated by these newly 
marked key-vertices. There are at most 2n of these key-vertices since the depth of the parse 
tree is bounded by the number of different SLP variables. Moreover, they each generate a 
substring of length smaller than x for the following reason. Assume for contradiction that 
one of them generates a string of length greater than x. This would imply the existence of 
some vertex between Vi and Vi + \ in the v±,V2, ■ ■ ■ ,vg ordering. 
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To conclude, we showed that A = A' 1 AiA' 2 A2 ■ ■ ■ A' e AeA' f+1 where t < N/x, every Ai 
is of length 0(x) and is generated by one vertex, and every A\ is a concatenation of 0(n) 
substrings each of length smaller than x and generated by one vertex. Overall, we get that 
y = 0(n£) = 0(nN/x) vertices suffice to generate A for every x < N. It is easy to see 
that we can identify these vertices in 0{N) time thus proving Lemma 4.1. By choosing 

2 1 

x = Nz/{n\gN)z, and using the block edit distance time complexity of Lemma 3.4, this 
implies an 0(n 1,34 ./V L34 ) time algorithm for arbitrary scoring functions. 

5. Improvement for Rational Scoring Functions 

In this section we show that in the case of rational scoring functions, the time complexity 
of the block edit distance procedure can be reduced substantially by using a recursive 
construction of the DIST tables. In particular, we complete the proof of Theorem 1.2 by 
showing that in this case the repository of DIST tables can be computed in 0(n 2 x L5 ) time. 
This implies an 0(ri L4 iV 1 - 2 ) time algorithm for rational scoring functions by considering 
x = N°' 8 /n 0A and the xy-partition with y = nN/x. 

Before we describe how to compute the repository in 0(n 2 x L5 ) time, we need to in- 
troduce some features that DIST tables over rational scoring functions have. The first 
property, discovered by Schmidt [26] , is what is known as the succinct representation prop- 
erty: Any xxx DIST table can be succinctly stored using only 0(x) space. This follows 
from considering the vector obtained by subtracting a DIST column from the column to 
its right, and observing that this vector has only a constant number of value changes. The 
second property is that succinct representations allow to efficiently merge two DIST tables. 
That is, if D\ and D 2 are two DIST tables, one between a pair of substrings A' and B' 
and the other between A' and B", then we refer to the DIST table between A' and B'B" 
as the product of merging D\ and D 2 . A recent important result of Tiskin [30] shows how 
to utilize the succinct representation of DIST tables in order to merge two succinct xxx 
DIST tables in 0(x L5 ) time. 

Lemma 5.1. The block edit distance algorithm runs in 0(n 2 x L5 + Ny) time in case the 
underlying scoring function is rational. 

Proof. To prove the lemma it suffices to show how to compute the repository of DIST tables 
in step 1 of the block edit-distance procedure in 0(n 2 x L5 ) time, in case the underlying 
scoring function is rational. We will work with succinct representations of the DIST tables 
as described above. Say X — > X p X q and Y — > Y s Y t are two rules in the SLPs A and B 
respectively. To compute the DIST table that corresponds to the strings generated by X 
and Y, we first recursively compute the four DIST tables that correspond to the pairs 
(X p ,Y s ), (X p ,Yt), (X q ,Y s ), and (X q ,Yt). We then merge these four tables to obtain the 
DIST table that corresponds to (X, Y). To do so we use Tiskin's procedure to merge 
(X p ,Y s ) with (X p ,Y t ) into (X p ,Y s T t ), then merge (X q ,Y s ) with (X q ,Y t ) into (X q ,Y s T t ), 
and finally we merge (X p ,Y s Tt) and (X q ,Y s Tt) into (X p X q ,Y s Tt) = (X, Y). This recursive 
procedure computes each succinct DIST table by three merge operations, each taking 
0{x 1 - b ) time and 0{x) space. Since the number of different DIST tables is bounded by 
0(n 2 ), the 0(n 2 x 1 - 5 ) time for constructing the repository follows. ■ 
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To conclude, we have shown an 0(n 2 x L5 + Ny) time algorithm for computing the edit 
distance. Using the xy-partition from Lemma 4.1 with x = N°' s /n 0A and y = nN/x, we 
get a time complexity of 0(n N ) . 

6. Four-Russian Interpretation 

In the previous sections we showed how SLPs can be used to speed up the edit distance 
computation of strings that compress well under some compression scheme. In this section, 
we conclude the presentation of our SLP framework by presenting an f2(lg N) speed-up for 
strings that do not compress well under any compression scheme. To do so, we adopt the 
Four Russions approach of Masek and Paterson [21] that utilizes a naive property that every 
string over a fixed alphabet has. Namely, that short enough substrings must appear many 
times. However, while the Masek and Paterson algorithm can only handle rational scoring 
functions, the SLP version that we propose can handle arbitrary scoring functions. 

Consider a string A of length N over an alphabet S. The parse tree of the naive SLP 
A is a complete binary tree with N leaves 2 . This way, for every x < N we get that A is 
the concatenation of 0(N/x) substrings each of length 0(x) and each can be generated by 
some variable in A. This partition of A suggests an xy-partition in which y = N/x. At 
first glance, this might seem better than the partition guarantee of Lemma 4.1 in which 
y = nN/x. However, notice that in the naive SLP we have n > N so we can not afford to 
compute a repository of 0(n 2 ) DIST tables. 

To overcome this problem, we choose x small enough so that \^\ x , the number of 
possible substrings of length x, is small. In particular, by taking x = \ log| S N we get 

that the number of possible substrings of length x is bounded by = y/~N. This implies 
an xy-partition in which x = ^ log| S | N, y = N/x, and the number of distinct blocks n' is 
O(N). Using this partition, we get that the total construction time of the DIST repository 
is 0{n'x 2 lgx). Similar to Lemma 3.4, we get that the total running time of the block edit 
distance algorithm is 0{n'x 2 lgx + Ny) which gives 0(N 2 / lg N). 
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